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Abstract 

I  describe  how  to  create  with  machine  learning  techniques  a  generative,  videorealistic, 
speech  animation  module.  A  human  subject  is  first  recorded  using  a  videocamera  as 
he/she  utters  a  pre-determined  speech  corpus.  After  processing  the  corpus  automat¬ 
ically,  a  visual  speech  module  is  learned  from  the  data  that  is  capable  of  synthesizing 
the  human  subject’s  mouth  uttering  entirely  novel  utterances  that  were  not  recorded 
in  the  original  video.  The  synthesized  utterance  is  re-composited  onto  a  background 
sequence  which  contains  natural  head  and  eye  movement.  The  final  output  is  video- 
realistic  in  the  sense  that  it  looks  like  a  video  camera  recording  of  the  subject.  At  run 
time,  the  input  to  the  system  can  be  either  real  audio  sequences  or  synthetic  audio 
produced  by  a  text-to-speech  system,  as  long  as  they  have  been  phonetically  aligned. 

The  two  key  contributions  of  this  work  are 

•  a  variant  of  the  multidimensional  morphable  model  (MMM)  [4]  [26]  [25]  to  syn¬ 
thesize  new,  previously  unseen  mouth  configurations  from  a  small  set  of  mouth 
image  prototypes, 

•  a  trajectory  synthesis  technique  based  on  regularization,  which  is  automatically 
trained  from  the  recorded  video  corpus,  and  which  is  capable  of  synthesizing 
trajectories  in  MMM  space  corresponding  to  any  desired  utterance. 

Results  are  presented  on  a  series  of  numerical  and  psychophysical  experiments 
designed  to  evaluate  the  synthetic  animations. 

Thesis  Supervisor:  Tomaso  Poggio 
Title:  Whitaker  Professor 
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Chapter  1 


Introduction 


Is  it  possible  to  record  a  human  subject  with  a  video  camera,  process  the  recorded  data 
automatically,  and  then  re-animate  that  subject  uttering  entirely  novel  utterances 
which  were  not  included  in  the  original  corpus?  In  this  work,  we  present  such  a 
technique  for  achieving  videorealistic  speech  animation. 

We  choose  to  focus  our  efforts  in  this  work  on  the  issues  related  to  the  synthesis  of 
novel  video,  and  not  on  novel  audio  synthesis.  Thus,  novel  audio  needs  to  be  provided 
as  input  to  our  system.  This  audio  can  be  either  real  human  audio  (from  the  same 
subject  or  a  different  subject),  or  synthetic  audio  produced  by  a  text-to-speech  system. 
All  that  is  required  by  our  system  is  that  the  audio  be  phonetically  transcribed  and 
aligned.  In  the  case  of  synthetic  audio  from  TTS  systems,  this  phonetic  alignment  is 
readily  available  from  the  TTS  system  itself  [7].  In  the  case  of  real  audio,  publicly 
available  phonetic  alignment  systems  [24]  may  be  used. 

Our  visual  speech  processing  system  is  composed  of  two  modules:  The  first  mod¬ 
ule  is  the  multidimensional  morphable  model  (MMM),  which  is  capable  of  morphing 
between  a  small  set  of  prototype  mouth  images  to  synthesize  new,  previously  unseen 
mouth  configurations.  The  second  component  is  a  trajectory  synthesis  module,  which 
uses  regularization  [21]  [42]  to  synthesize  smooth  trajectories  in  MMM  space  for  any 
specified  utterance.  The  parameters  of  the  trajectory  synthesis  module  are  trained 
automatically  from  the  recorded  corpus  using  gradient  descent  learning. 

Recording  the  video  corpus  takes  on  the  order  of  15  minutes.  Processing  of  the 
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Figure  1-1:  Some  of  the  synthetic  facial  configurations  output  by  our  system. 

corpus  takes  on  the  order  of  several  days,  but,  apart  from  the  specification  of  head 
and  eye  masks  shown  in  Figure  5-1,  is  fully  automatic ,  requiring  no  intervention  on 
the  part  of  the  user.  The  final  visual  speech  synthesis  module  consists  of  a  small  set  of 
prototype  images  (46  images  in  the  case  presented  here)  extracted  from  the  recorded 
corpus  and  used  to  synthesize  all  novel  sequences. 

Application  scenarios  for  videorealistic  speech  animation  include:  user-interface 
agents  for  desktops,  TVs,  or  cell-phones;  digital  actors  in  movies;  virtual  avatars  in 
chatrooms;  very  low  bitrate  coding  schemes  (such  as  MPEG4);  and  studies  of  visual 
speech  production  and  perception.  The  recorded  subjects  can  be  regular  people, 
celebrities,  ex-presidents,  or  infamous  terrorists. 

In  the  following  section,  we  begin  by  first  reviewing  the  relevant  prior  work  and 
motivating  our  approach. 
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Chapter  2 


Background 

2.1  Facial  Modeling 

One  approach  at  facial  modeling  is  to  model  the  face  using  3D  modeling  methods. 
Parke  [33]  was  one  of  the  earliest  to  adopt  such  an  approach  by  creating  a  polygonal 
facial  model.  To  increase  the  visual  realism  of  the  underlying  facial  model,  the  facial 
geometry  is  frequently  scanned  in  using  Cyberware  laser  scanners.  Additionally,  a 
texture-map  of  the  face  extracted  by  the  Cyberware  scanner  may  be  mapped  onto 
the  three-dimensional  geometry  [29].  Guenter  [22]  demonstrated  recent  attempts 
at  obtaining  3D  face  geometry  from  multiple  photographs  using  photogrammetric 
techniques.  Pighin  et  al.  [35]  captured  face  geometry  and  textures  by  fitting  a  generic 
face  model  to  a  number  of  photographs.  Blanz  and  Vetter  [9]  demonstrated  how  a 
large  database  of  Cyberware  scans  may  be  morphed  to  obtain  face  geometry  from  a 
single  photograph. 

An  alternative  to  the  3D  modeling  approach  is  to  model  the  talking  face  using 
image-based  techniques,  where  the  talking  facial  model  is  constructed  using  a  col¬ 
lection  of  example  images  captured  of  the  human  subject.  These  methods  have  the 
potential  of  achieving  very  high  levels  of  videorealism,  and  are  inspired  by  the  recent 
success  of  similar  sample-based  methods  for  audio  speech  synthesis  [32], 

Image-based  facial  animation  techniques  need  to  solve  the  video  generation  prob¬ 
lem:  How  does  one  build  a  generative  model  of  novel  video  that  is  simultaneously 
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photorealistic ,  videorealistic,  and  parsimonious ?  Photorealism  means  that  the  novel 
generated  images  exhibit  the  correct  visual  structure  of  the  lips,  teeth,  and  tongue. 
Videorealism  means  that  the  generated  sequences  exhibit  the  correct  motion,  dynam¬ 
ics,  and  coarticulation  effects  [16].  Parsimony  means  that  the  generative  model  is 
represented  compactly  using  a  few  parameters. 

2.1.1  Video  Rewrite 

Bregler,  Covell,  and  Slaney  [12]  describe  an  image-based  facial  animation  system 
called  Video  Rewrite  in  which  the  video  generation  problem  is  addressed  by  breaking 
down  the  recorded  video  corpus  into  a  set  of  smaller  audiovisual  basis  units.  Each 
one  of  these  short  sequences  is  a  triphone  segment,  and  a  large  database  with  all  the 
acquired  triphones  is  built.  A  new  audiovisual  sentence  is  constructed  by  concatenat¬ 
ing  the  appropriate  triphone  sequences  from  the  database  together.  Photorealism  in 
Video  Rewrite  is  addressed  by  only  using  recorded  sequences  to  generate  the  novel 
video.  Videorealism  is  achieved  by  using  triphone  contexts  to  model  coarticulation 
effects.  In  order  to  handle  all  the  possible  triphone  contexts,  however,  the  system 
requires  a  library  with  tens  and  possibly  hundreds  of  thousands  of  subsequences, 
which  seems  to  be  an  overly-redundant  and  non-parsimonious  sampling  of  human  lip 
configurations.  Parsimony  is  thus  sacrificed  in  favor  of  videorealism. 

Essentially,  Video  Rewrite  adopts  a  decidedly  agnostic  approach  to  animation: 
since  it  does  not  have  the  capacity  to  generate  novel  lip  imagery  from  a  few  recorded 
images,  it  relies  on  the  re-sequencing  of  a  vast  amount  of  original  video.  Since  it 
does  not  have  the  capacity  to  model  how  the  mouth  moves,  it  relies  on  sampling  the 
dynamics  of  the  mouth  using  triphone  segments. 

2.1.2  Multidimensional  Morphable  Models 

The  approach  used  in  this  work  presents  another  approach  to  solving  the  video  gen¬ 
eration  problem  which  has  the  capacity  to  generate  novel  video  from  a  small  number 
of  examples  as  well  as  the  capacity  to  model  how  the  mouth  moves.  This  approach  is 


14 


based  on  the  use  of  a  multidimensional  morphable  model  (MMM),  which  is  capable  of 
multdimensional  morphing  between  various  lip  images  to  synthesize  new,  previously 
unseen  lip  configurations.  MMM’s  have  already  been  introduced  in  other  works  [36] 
[4]  [17]  [25]  [28]  [9]  [8].  In  this  work,  we  develop  an  MMM  variant  and  show  its  utility 
for  facial  animation. 

MMM’s  are  powerful  models  of  image  appearance  because  they  combine  the  power 
of  vector  space  representations  with  the  realism  of  morphing  as  a  generative  image 
technique.  Prototype  example  images  of  the  mouth  are  decomposed  into  pixel  flow 
and  pixel  appearance  axes  that  represent  basis  vectors  of  image  variation.  These 
basis  vectors  are  combined  in  a  multidimensional  fashion  to  produce  novel,  realistic, 
previously  unseen  lip  configurations. 

As  such,  an  MMM  is  more  powerful  than  other  vector  space  representations  of 
images  which  do  not  model  pixel  flow  explicitly.  Cosatto  and  Graf  [19],  for  example, 
describe  an  approach  which  is  similar  to  ours,  except  that  their  generative  model 
involved  simple  pixel  blending  of  images,  which  fails  to  produce  realistic  transitions 
between  mouth  configurations. 

An  MMM  is  also  more  powerful  than  simple  1-dimensional  morphing  between  2 
image  end-points  [2],  as  well  as  techniques  such  as  those  of  Scott,  Kagels,  et  al.  [38] 
[44]  and  Ezzat  and  Poggio  [20],  which  morphed  between  several  visemes  in  pairwise 
fashion.  By  embedding  the  prototype  images  in  a  vector  space,  an  MMM  is  capable  of 
generating  smooth  curves  through  lip  space  which  handle  complex  speech  animation 
effects  in  a  non-ad-hoc  manner. 


2.2  Speech  Animation 

Speech  animation  techniques  have  traditionally  included  both  keyframing  methods 
and  physics-based  methods,  and  have  been  extended  more  recently  to  include  ma¬ 
chine  learning  methods.  In  keyframing,  the  animator  specifies  particular  key-frames, 
and  the  system  generates  intermediate  values  [33]  [34]  [16]  [30].  In  physics-based 
methods,  the  animator  relies  on  the  laws  of  physics  to  determine  the  mouth  move- 
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ment,  given  some  initial  conditions  and  a  set  of  forces  for  all  time.  This  technique, 
which  requires  modeling  the  underlying  facial  muscles  and  skin,  was  demonstrated 
quite  effectively  by  [43]  [29].  Finally,  machine  learning  methods  are  a  new  class  of  an¬ 
imation  tools  which  are  trained  from  recorded  data  and  then  used  to  synthesize  new 
motion.  Examples  include  hidden  markov  models  (HMMs),  which  were  demonstrated 
effectively  for  speech  animation  by  [10]  [31]  [13]. 

Speech  animation  needs  to  solve  several  problems  simultaneously:  firstly,  the  ani¬ 
mation  needs  to  have  the  correct  motion ,  in  the  sense  that  the  appropriate  phonemic 
targets  need  to  be  realized  by  the  moving  mouth.  Secondly,  the  animation  needs 
to  be  smooth ,  not  exhibiting  any  unnecessary  jerks.  Thirdly,  it  needs  to  display  the 
correct  dynamics',  plosives  such  as  b  and  p  need  to  occur  fast.  Finally,  speech  anima¬ 
tion  needs  to  display  the  correct  coarticulation  effects,  which  determine  the  effects  of 
neighboring  phonemes  on  the  current  phoneme  shape. 

In  this  work,  we  present  a  trajectory  synthesis  module  to  address  the  issues  of  syn¬ 
thesizing  mouth  trajectories  with  correct  motion,  smoothness,  dynamics,  and  coar¬ 
ticulation  effects.  This  module  maps  from  an  input  stream  of  phonemes  (with  their 
respective  frame  durations)  to  a  trajectory  of  MMM  shape-appearance  parameters. 
This  trajectory  is  then  fed  into  the  MMM  to  synthesize  the  final  visual  stream  that 
represents  the  talking  face. 

Unlike  Video  Rewrite  [12],  which  relies  on  an  exhaustive  sampling  of  triphone 
segments  to  model  phonetic  contexts,  coarticulation  effects  in  our  system  emerge  di¬ 
rectly  from  our  speech  model.  Each  phoneme  in  our  model  is  represented  as  a  localized 
Gaussian  target  region  in  MMM  space  with  a  particular  position  and  covariance.  The 
covariance  of  each  phoneme  acts  as  a  spring  whose  tension  pulls  the  trajectory  to¬ 
wards  each  phonetic  region  with  a  force  proportional  to  observed  coarticulation  effects 
in  the  data. 

However,  unlike  Massaro  and  Cohen  [16]  (who  also  modeled  coarticulation  using 
localized  Gaussian-like  regions),  our  model  of  coarticulation  is  not  hand-tuned,  but 
rather  trained  from  the  recorded  corpus  itself  using  a  gradient  descent  learning  proce¬ 
dure.  The  training  process  determines  the  position  and  shape  of  the  phonetic  regions 
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in  MMM  space  in  a  manner  which  optimally  reconstructs  the  recorded  corpus  data. 
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Chapter  3 


System  Overview 


Analysis 


Synthesis 

Figure  3-1:  An  overview  of  our  videorealistic  facial  animation  system. 

An  overview  of  our  system  is  shown  in  Figure  3-1.  After  recording  the  corpus 
(Section  4),  analysis  is  performed  to  produce  the  final  visual  speech  module.  Analysis 
itself  consists  of  three  sub-steps:  First,  the  corpus  is  pre-processed  (Section  5)  to  align 
the  audio  and  normalize  the  images  to  remove  head  movement.  Next,  the  MMM  is 
created  from  the  images  in  the  corpus  (Section  6.2).  Finally,  the  corpus  sequences 
are  analyzed  to  produce  the  phonetic  models  used  by  the  trajectory  synthesis  module 
(Sections  6.4  and  7.2). 

Given  a  novel  audio  stream  that  is  phonetically  aligned,  synthesis  proceeds  in  three 
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steps:  First,  the  trajectory  synthesis  module  is  used  to  synthesize  the  trajectory  in 
MMM  space  using  the  trained  phonetic  models  (Section  7).  Secondly,  the  MMM  is 
used  to  synthesize  the  novel  visual  stream  from  the  trajectory  parameters  (Section 
6.3).  Finally,  the  post-processing  stage  composites  the  novel  mouth  movement  onto 
a  background  sequence  containing  natural  eye  and  head  movements  (Section  8). 
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Chapter  4 


Corpus 


An  audiovisual  corpus  of  a  human  subject  uttering  various  utterances  was  recorded. 
Recording  was  performed  at  a  TV  studio  against  a  blue  “chroma-key”  background 
with  a  standard  Sony  analog  TV  camera.  The  data  was  subsequently  digitized  at 
a  29.97  fps  NTSC  frame  rate  with  an  image  resolution  of  640  by  480  and  an  audio 
resolution  of  44.1KHz.  The  final  sequences  were  stored  as  Quicktime  sequences  com¬ 
pressed  using  a  Sorenson  coder.  The  recorded  corpus  lasts  for  15  minutes,  and  is 
composed  of  approximately  30000  frames. 

The  recorded  corpus  consisted  of  1-syllable  and  2-syllable  words,  such  as  ‘  ‘bed5  ’ 
and  ^^dagger,,.  A  total  of  152  1-syllable  words  and  156  2-syllable  words  were 
recorded.  In  addition,  the  corpus  included  105  short  sentences,  such  as  ‘  'The  statue 
was  closed  to  tourists  Sunday’’.  The  subject  was  asked  to  utter  all  sentences 
in  a  neutral  expression.  In  addition,  the  sentences  themselves  were  designed  to  elicit 
no  emotions  from  the  subject. 
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Chapter  5 


Pre-Processing 


The  recorded  corpus  data  needs  to  be  pre-processed  in  several  ways  before  it  may  be 
processed  effectively  for  re-animation. 

5.1  Audio  Alignment 

Firstly,  the  audio  needs  to  be  phonetically  aligned  in  order  to  be  able  to  associate 
a  phoneme  for  each  image  in  the  corpus.  We  perform  audio  alignment  on  all  the 
recorded  sequences  using  the  CMU  Sphinx  system  [24],  which  is  publicly  available. 
Given  an  audio  sequence  and  an  associated  text  transcript  of  the  speech  being  uttered, 
alignment  systems  use  forced  Viterbi  search  to  find  the  optimal  start  and  end  of 
phonemes  for  the  given  audio  sequence.  The  alignment  task  is  easier  than  the  speech 
recognition  task  because  the  text  of  the  audio  being  uttered  is  known  apriori. 

5.2  Head  Movement  Normalization 

Secondly,  each  image  in  the  corpus  needs  to  be  normalized  so  that  only  movement 
occurring  in  the  entire  frame  is  the  mouth  movement  associated  with  speech.  Al¬ 
though  the  subject  was  instructed  to  keep  her  head  steady  during  recording,  residual 
head  movement  nevertheless  still  exists  in  the  final  recorded  sequences.  Since  the 
head  motion  is  small,  we  make  the  simplifying  assumption  that  it  can  be  approxi- 
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Figure  5-1:  The  head,  mouth,  eye,  and  background  masks  used  in  the  pre-processing 
and  post-processing  steps.  Specification  of  these  masks  is  the  only  manual  step  re¬ 
quired  by  this  system. 

mated  as  the  perspective  motion  of  a  plane  lying  on  the  surface  of  the  face.  Planar 
perspective  deformations  [45]  have  8  degrees  of  freedom,  and  can  be  inferred  using  4 
corresponding  points  between  a  reference  frame  and  the  current  frame.  We  employ 
optical  flow  [23]  [1]  [3]  to  extract  correspondences  for  640x480  pixels,  and  use  least 
squares  to  solve  the  overdetermined  system  of  equations  to  obtain  the  8  parameters  of 
the  perspective  warp.  Among  the  640x480  correspondences,  only  those  lying  within 
the  head  mask  shown  in  Figure  5-1  are  used.  Pixels  from  the  background  area  are  not 
used  because  they  do  not  exhibit  any  motion  at  all,  and  those  from  the  mouth  area 
exhibit  non-rigid  motion  associated  with  speech. 

After  computing  the  8  planar  perspective  parameters,  the  image  is  warped  towards 
the  reference  frame.  This  is  performed  for  all  images  in  the  corpus.  After  warping 
the  images  are  cropped  to  a  dimension  of  624x420  to  eliminate  the  border  artifacts 
associated  with  the  warp. 

The  images  in  the  corpus  also  exhibit  residual  eye  movement  and  eye  blinks  which 
need  to  be  removed.  An  eye  mask  is  created  (see  Figure  5-1)  which  allows  just  the 
eyes  from  a  single  frame  to  be  pasted  onto  the  rest  of  the  corpus  imagery.  The  eye 
mask  is  blurred  at  the  edges  to  allow  a  seamless  blend  between  the  pasted  eyes  and 


the  rest  of  face. 
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Chapter  6 


Multidimensional  Morphable 
Models 


At  the  heart  of  our  visual  speech  synthesis  approach  is  the  multidimensional  mor¬ 
phable  model  representation,  which  is  a  generative  model  of  video  capable  of  morphing 
between  various  lip  images  to  synthesize  new,  previously  unseen  lip  configurations. 

The  basic  underlying  assumption  of  the  MMM  is  that  the  complete  set  of  mouth 
images  associated  with  human  speech  lies  in  a  low-dimensional  space  whose  axes 
represent  mouth  appearance  variation  and  mouth  shape  variation.  Mouth  appearance 
is  represented  in  the  MMM  as  a  set  of  prototype  images  extracted  from  the  recorded 
corpus.  Mouth  shape  is  represented  in  the  MMM  as  a  set  of  optical  flow  vectors  [23] 
computed  automatically  from  the  recorded  corpus.  In  the  work  presented  here,  46 
images  are  extracted  and  46  optical  flow  correspondences  are  computed.  The  low¬ 
dimensional  MMM  space  is  parameterized  by  shape  parameters  a  and  appearance 
parameters  /3. 

The  MMM  may  be  viewed  as  a  “black  box”  capable  of  performing  two  tasks: 
Firstly,  given  as  input  a  set  of  parameters  (a,  /3).  the  MMM  is  capable  of  synthesizing 
an  image  of  the  subject’s  face  with  that  shape-appearance  configuration.  Synthesis 
is  performed  by  morphing  the  various  prototype  images  to  produce  novel,  previously 
unseen  mouth  images  which  correspond  to  the  input  parameters  (ai,/3). 

Conversely,  the  MMM  can  also  perform  analysis-,  given  an  input  lip  image,  the 
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MMM  computes  shape  and  appearance  parameters  (ct,  /3)  that  represent  the  position 
of  that  input  image  in  MMM  space.  In  this  manner,  it  is  possible  to  project  the  entire 
recorded  corpus  onto  the  constructed  MMM,  and  produce  a  time  series  of  (at.  flf) 
parameters  that  represent  trajectories  of  mouth  motion  in  MMM  space.  We  term 
this  operation  analyzing  the  recorded  corpus. 

In  the  following  sections,  we  describe  how  a  multidimensional  morphable  model 
is  defined,  how  it  may  be  acquired  automatically  from  a  recorded  video  corpus,  how 
it  may  be  used  for  synthesis,  and,  finally,  how  such  a  morphable  model  may  be  used 
for  analysis. 

6.1  Definition 

An  MMM  consists  of  a  set  of  prototype  images  that  represent  the  various  lip 

textures  that  will  be  encapsulated  by  the  MMM.  One  image  is  designated  arbitrarily 
to  be  the  reference  image  fl. 

Additionally,  the  MMM  consists  of  a  set  of  prototype  flows  {Ci\fl=l  that  represent 
the  correspondences  between  the  reference  image  I\  and  the  other  prototype  images 
in  the  MMM.  The  correspondence  from  the  reference  image  to  itself,  C\.  is  designated 
to  be  an  empty,  zero,  flow. 

In  this  work,  we  choose  to  represent  the  correspondence  maps  using  relative  dis¬ 
placement  vectors: 


Qp)  =  K(p),<,(p)}.  (6.1) 

A  pixel  in  image  at  position  p  =  (x,  y)  corresponds  to  a  pixel  in  image  at  position 
0 x  +  <Px(x,y),y  +  <?y(x,y )). 

Previous  methods  for  computing  correspondence  [2]  [38]  [27]  adopted  feature-based 
approaches,  in  which  a  set  of  high-level  shape  features  common  to  both  images  is  spec¬ 
ified.  When  it  is  done  by  hand,  however,  this  feature  specification  process  can  become 
quite  tedious  and  complicated,  especially  in  cases  when  a  large  amount  of  imagery  is 
involved.  In  this  work,  we  make  use  of  optical  flow  [23]  [1]  [3]  algorithms  to  estimate 
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Figure  6-1:  24  of  the  46  image  prototypes  included  in  the  MMM.  The  reference  image 
is  the  top  left  frame. 

this  motion.  This  motion  is  captured  as  a  two-dimensional  array  of  displacement 
vectors,  in  the  same  exact  format  shown  in  Equation  6.1.  In  particular,  we  utilize  the 
coarse-to-fine,  gradient-based  optical  flow  algorithms  developed  by  [3].  These  algo¬ 
rithms  compute  the  desired  flow  displacements  using  the  spatial  and  temporal  image 
derivatives.  In  addition,  they  embed  the  flow  estimation  procedure  in  a  multiscale 
pyramidal  framework  [14],  where  initial  displacement  estimates  are  obtained  at  coarse 
resolutions,  and  then  propagated  to  higher  resolution  levels  of  the  pyramid. 
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6.2  Building  an  MMM 

An  MMM  must  be  constructed  automatically  from  a  recorded  corpus  of  {Ij}j=1  im¬ 
ages.  The  two  main  tasks  involved  are  to  choose  the  image  prototypes  and  to 

compute  the  correspondence  {C'*}lA1  between  them.  We  discuss  the  steps  to  do  this 
briefly  below.  Note  that  the  following  operations  are  performed  on  the  entire  face 
region,  although  they  need  only  be  performed  on  the  region  around  the  mouth. 

6.2.1  PCA 

For  the  purpose  of  more  efficient  processing,  principal  component  analysis  (PCA) 
is  first  performed  on  all  the  images  of  the  recorded  video  corpus.  PCA  allows  each 
image  in  the  video  corpus  to  be  represented  using  a  set  of  low-dimensional  parameters. 
This  set  of  low-dimensional  parameters  may  thus  be  easily  loaded  into  memory  and 
processed  efficiently  in  the  subsequent  clustering  and  Dijkstra  steps. 

Performing  PCA  using  classical  autocovariance  methods  [6],  however,  usually  re¬ 
quires  loading  all  the  images  and  computing  a  very  large  autocovariance  matrix,  which 
requires  a  lot  of  memory.  To  avoid  this,  we  adopt  an  on-line  PCA  method,  termed 
EM-PCA  [37]  [41],  which  allows  us  to  perform  PCA  on  the  images  in  the  corpus 
without  loading  them  all  into  memory.  EM-PCA  is  iterative,  requiring  several  iter¬ 
ations,  but  is  guaranteed  to  converge  in  the  limit  to  the  same  principal  components 
that  would  be  extracted  from  the  classical  autocovariance  method.  The  EM-PCA 
algorithm  is  typically  run  in  this  work  for  10  iterations. 

Performing  EM-PCA  produces  a  set  of  D  624x472  principal  components  and  a 
matrix  E  of  eigenvalues.  In  this  work,  D  —  15  PCA  bases  are  retained.  The  images 
in  the  video  corpus  are  subsequently  projected  on  the  principal  components,  and  each 
image  I3  is  represented  with  a  D-dimensional  parameter  vector  pj. 

6.2.2  K-means  Clustering 

Selection  of  the  prototype  images  is  performed  using  k-means  clustering  [6].  The  algo¬ 
rithm  is  applied  directly  on  the  {pj}j= i  low  dimensional  PCA  parameters,  producing 
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N  cluster  centers.  Typically  the  cluster  centers  extracted  by  k-means  clustering  do 
not  coincide  with  actual  image  datapoints,  so  the  nearest  images  in  the  dataset  to 
the  computed  cluster  centers  are  chosen  to  be  the  final  image  prototypes  {If\f=  i  for 
use  in  our  MMM. 

It  should  be  noted  that  k-means  clustering  requires  the  use  of  an  internal  distance 
metric  with  which  to  compare  distances  between  datapoints  and  the  chosen  cluster 
centers.  In  our  case,  since  the  image  parameters  are  themselves  produced  by  PCA, 
the  appropriate  distance  metric  between  two  points  pm  and  pn  is  the  Mahalanobis 
distance  metric: 


d{Pm,Pn)  =  (pm  ~  Pnf^  1  (Pm  ~  Pn)  (6-2) 

where  E  is  the  afore-mentioned  matrix  of  eigenvalues  extracted  by  the  EM-PCA 
procedure. 

We  selected  N  =  46  image  prototypes  in  this  work,  which  are  partly  shown  in 
Figure  6-1.  The  top  left  image  is  the  reference  image  I\.  There  is  nothing  magical 
about  our  choice  of  46  prototypes,  which  is  in  keeping  with  the  typical  number  of 
visemes  other  researchers  have  used  [38]  [20].  It  should  be  noted,  however,  that  the 
46  prototypes  have  no  explicit  relationship  to  visemes,  and  instead  form  a  simple  basis 
set  of  image  textures. 

6.2.3  Dijkstra 

After  the  N  =  46  image  prototypes  are  chosen,  the  next  step  in  building  an  MMM  is  to 
compute  correspondence  between  the  reference  image  Ii  and  all  the  other  prototypes. 
Although  it  is  in  principle  possible  to  compute  direct  optical  flow  between  the  images, 
we  have  found  that  direct  application  of  optical  flow  is  not  capable  of  estimating  good 
correspondence  when  the  underlying  lip  displacements  between  images  are  greater 
than  5  pixels. 

It  is  possible  to  use  flow  concatenation  to  overcome  this  problem.  Since  the  original 
corpus  is  digitized  at  29.97  fps,  there  are  many  intermediate  frames  that  lie  between 
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the  chosen  prototypes.  A  series  of  consecutive  optical  flow  vectors  between  each 
intermediate  image  and  its  successor  may  be  computed  and  concatenated  into  one 
large  flow  vector  that  defines  the  global  transformation  between  the  chosen  prototypes 
(see  Appendix  A  for  details  on  flow  concatenation). 

Typically,  however,  prototype  images  are  very  far  apart  in  the  recorded  visual 
corpus,  so  it  is  not  practical  to  compute  concatenated  optical  flow  between  them. 
The  repeated  concatenation  that  would  be  involved  across  the  hundreds  or  thousands 
of  intermediate  frames  leads  to  a  considerably  degraded  final  flow. 

To  compute  good  correspondence  between  prototypes,  a  method  is  needed  to  fig¬ 
ure  out  how  to  compute  the  path  from  the  reference  example  I\  to  the  chosen  image 
prototypes  Ii  without  repeated  concatenation  over  hundreds  or  thousands  of  interme¬ 
diates  frames.  We  accomplish  this  by  constructing  the  corpus  graph  representation  of 
the  corpus:  A  corpus  graph  is  an  S-by-S  sparse  adjacency  graph  matrix  in  which  each 
frame  in  the  corpus  is  represented  as  a  node  in  a  graph  connected  to  k  nearest  images. 
The  k  nearest  images  are  chosen  using  the  k-nearest  neighbors  algorithm  [6],  and  the 
distance  metric  used  is  the  Mahalanobis  distance  in  Equation  6.2  applied  to  the  PCA 
parameters  p.  Thus,  an  image  is  connected  in  the  graph  to  the  k  other  images  that 
look  most  similar  to  it.  The  edge- weight  between  a  frame  and  its  neighbor  is  the 
value  of  the  Mahalanobis  distance.  We  set  k  —  20  in  this  work. 

After  the  corpus  graph  is  computed,  the  Dijkstra  shortest  path  algorithm  [18]  [40] 
is  used  to  compute  the  shortest  path  between  the  reference  example  Ii  and  the  other 
chosen  image  prototypes  ij.  Each  shortest  path  produced  by  the  Dijkstra  algorithm  is 
a  list  of  images  from  the  corpus  that  cumulatively  represent  the  shortest  deformation 
path  from  Ii  to  Ii  as  measured  by  the  Mahalanobis  distance.  Concatenated  flow  from 
Ji  to  Ii  is  then  computed  along  the  intermediate  images  produced  by  the  Dijkstra 
algorithm.  Since  there  are  46  images,  N  —  46  correspondences  {Ci}f=l  are  computed 
in  this  fashion  from  the  reference  image  I\  to  the  other  image  prototypes 
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Figure  6-2:  The  flow  reorientation  process:  First,  C*  is  subtracted  from  the  synthe¬ 
sized  flow  dynth  .  Second,  this  flow  vector  is  itself  forward  warped  along  C'j. 

6.3  Synthesis 

The  goal  of  synthesis  is  to  map  from  the  multidimensional  parameter  space  (a,  /3 )  to  an 
image  which  lies  at  that  position  in  MMM  space.  Since  there  are  46  correspondences, 
a  is  a  46-dimensional  parameter  vector  that  controls  mouth  shape.  Similarly,  since 
there  are  46  image  prototypes,  (3  is  a  46-dimensional  parameter  vector  that  controls 
mouth  texture.  The  total  dimensionality  of  (a,  0)  is  92. 

Synthesis  first  proceeds  by  synthesizing  a  new  correspondence  Csynth  using  linear 
combination  of  the  prototype  flows  C\: 


C{ynth  =  Y.atC%.  (6.3) 

i= 1 

The  subscript  1  in  Equation  6.3  above  is  used  to  emphasize  that  C'lynth  originates  from 
the  reference  image  /i,  since  all  the  prototype  flows  are  taken  with  0  as  reference. 

Forward  warping  may  be  used  to  push  the  pixels  of  the  reference  image  I\  along 
the  synthesized  correspondence  vector  C'lynlh.  Notationally,  we  denote  the  forward 
warping  operation  as  an  operator  W  (/,  C )  that  operates  on  an  image  I  and  a  corre¬ 
spondence  map  C  (see  Appendix  B  for  details  on  forward  warping). 

However,  a  single  forward  warp  will  not  utilize  the  image  texture  from  all  the 
examples.  In  order  to  take  into  account  all  image  texture,  a  correspondence  re- 
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orientation  procedure  first  described  in  [5]  is  adopted  that  re-orients  the  synthesized 
correspondence  vector  C[ynth  so  that  it  originates  from  each  of  the  other  example 
images  /j.  Reorientation  of  the  synthesized  flow  C'lynLh  proceeds  in  two  steps,  shown 
figuratively  in  Figure  6-2.  First,  C*  is  subtracted  from  the  synthesized  flow  C[ynth  to 
yield  a  flow  that  contains  the  correct  flow  geometry,  but  which  originates  from  the 
reference  example  I\  rather  than  the  desired  example  image  /*.  Secondly,  to  move 
the  flow  into  the  correct  reference  frame,  this  flow  vector  is  itself  warped  along  C%. 
The  entire  re-orientation  process  may  be  denoted  as  follows: 


C, =  WlCf'”'''  -  Ci,  Ci). 


(6,4) 


Re-orientation  is  performed  for  all  examples  in  the  example  set. 

The  third  step  in  synthesis  is  to  warp  the  prototype  images  along  the  re-oriented 
flows  C-ynth  to  generate  a  set  of  N  warped  image  textures  j™arped: 


Twarped 


W  {Ii,C°ynth). 


(6.5) 


The  fourth  and  final  step  is  to  blend  the  warped  images  J™arped  using  the  /3  parameters 
to  yield  the  final  morphed  image: 


N 


jmorph  =  £  p'jt 


warped 


(6.6) 


i=  1 

Combining  Equations  6.3  through  6.6  together,  our  MMM  synthesis  may  be  written 
as  follows: 


Imorph(a,  ,3)  =  £  AW(/„  W(f:  otjCj  -  a,  Ci)).  (6.7) 

i=  1  j-  1 

Empirically  we  have  found  that  the  MMM  synthesis  technique  is  capable  of  sur¬ 
prisingly  realistic  re-synthesis  of  lips,  teeth,  and  tongue.  However,  the  blending  of 
multiple  images  in  the  MMM  for  synthesis  tends  to  blur  out  some  of  the  finer  details 
in  the  teeth  and  tongue  (See  Appendix  C  for  a  discussion  of  synthesis  blur).  Shown 
in  Figure  6-3  are  some  of  the  synthetic  images  produced  by  our  system,  along  with 
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Figure  6-3:  Top:  Original  images  from  our  corpus.  Bottom:  Corresponding  synthetic 
images  generated  by  our  system. 

their  real  counterparts  for  comparison. 

6.4  Analysis 

The  goal  of  analysis  is  to  project  the  entire  recorded  corpus  {/j}J=1  onto  the  con¬ 
structed  MMM,  and  produce  a  time  series  of  ( <T/,A/)f=i  parameters  that  represent 
trajectories  of  the  original  mouth  motion  in  MMM  space. 

One  possible  approach  for  analysis  of  images  is  to  perform  analysis-by-synthesis: 
In  this  approach,  used  in  various  forms  in  [25]  [9],  the  synthesis  algorithm  is  used  to 
synthesize  an  image  Isynth(a,  /3),  which  is  then  compared  to  the  novel  image  using  an 
error  metric  (ie,  the  L2  norm).  Gradient-descent  is  then  usually  performed  to  change 
the  parameters  in  order  to  minimize  the  error,  and  the  synthesis  process  is  repeated. 
The  search  ends  when  a  local  minimum  is  achieved.  Analysis-by-synthesis,  however, 
is  very  slow  in  the  case  when  a  large  number  of  images  are  involved. 

In  this  work  we  choose  another  method  that  is  capable  of  extracting  parameters 
(ck,  f3)  in  one  iteration.  In  addition  to  the  image  Jnovel  to  be  analyzed,  the  method 
requires  that  the  correspondence  Cnovel  from  the  reference  image  ii  in  the  MMM 
to  the  novel  image  Inovel  be  computed  beforehand.  In  our  case,  most  of  the  novel 
imagery  to  be  analyzed  will  be  from  the  recorded  video  corpus  itself,  so  we  employ  the 
Dijkstra  approach  discussed  in  Section  6.2.3  to  compute  good  quality  correspondences 
between  the  reference  image  I\  and  In0Vei- 

Given  a  novel  image  Inovel  and  its  associated  correspondence  Cnovel ,  the  first  step 
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Ibw  parameters 


laxlure  parameters 


Figure  6-4:  Top:  Analyzed  cc*  flow  parameters  computed  for  one  image.  Bottom:  The 
corresponding  analyzed  fa  texture  parameters  computed  for  the  same  image.  The  fa 
texture  parameters  are  typically  zero  for  all  but  a  few  image  prototypes. 


of  the  analysis  algorithm  is  to  estimate  the  parameters  a  which  minimize 

||CmOTe,_f>iC'i||.  (6.8) 

i= 1 

This  is  solved  using  the  pseudo-inverse: 

a  =  (CTC)~1CTCnovel  (6.9) 

where  C  above  is  a  matrix  containing  all  the  prototype  correspondences  {Cj}^1. 

After  the  parameters  a  are  estimated,  N  image  warps  are  synthesized  in  the  same 
manner  as  described  in  Section  6.3  using  flow-reorientation  and  warping: 

rrp  =  w(4  w(f>Q  -  cm)).  (6.10) 

i= 1 

The  final  step  in  analysis  is  to  estimate  the  values  of  /3  as  the  values  which  minimize 
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|| Inovel  -  Eh  MWarPll  subject  to 

A  >  0  Vi  and  Eh  A  =  1- 


(6.11) 


The  non-negativity  constraint  above  on  the  A  parameters  ensures  that  pixel  values 
are  not  negated.  The  normalization  constraint  ensures  that  the  A  parameters  are 
computed  in  a  normalized  manner  for  each  frame,  which  prevents  brightness  flick¬ 
ering  during  synthesis.  The  form  of  the  imposed  constraints  cause  the  computed  A 
parameters  to  be  sparse  (see  Figure  6-4),  which  enables  efficient  synthesis  by  requir¬ 
ing  only  a  few  image  warps  (instead  of  the  complete  set  of  46  warps).  Equation  6.11, 
which  involves  the  minimization  of  a  quadratic  cost  function  subject  to  constraints, 
is  solved  using  quadratic  programming  methods.  In  this  work,  we  use  the  Matlab 
function  quadprog. 

Each  utterance  in  the  corpus  is  analyzed  with  respect  to  the  92-dimensional  MMM 
created  in  Section  6.2,  yielding  a  set  of  zt  =  (at,  A)  parameters  for  each  utterance. 
Analysis  takes  on  the  order  of  15  seconds  per  frame  on  a  circa  1998  450  MHz  Pentium 
II  machine.  Shown  in  Figure  7-2  in  solid  blue  are  example  analyzed  trajectories  for 
a  12  and  As  computed  for  the  word  tabloid. 
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Chapter  7 


Trajectory  Synthesis 

7.1  Overview 

The  goal  of  trajectory  synthesis  is  to  map  from  an  input  phone  stream  {Pt}  to  a 
trajectory  yt  =  (cq,  fit)  of  parameters  in  MMM  space.  After  the  parameters  are 
synthesized,  Equation  6.7  from  Section  6.3  is  used  to  create  the  final  visual  stream 
that  represents  the  talking  face. 

The  phone  stream  is  a  stream  of  phonemes  { Pt }  representing  that  phonetic  tran¬ 
scription  of  the  utterance.  For  example,  the  word  one  may  be  represented  by  a  phone 
stream  {Pt}}^  =  (\w\,  \w\,  \w\,  \w\,  \uh\,  \uh\,  \uh\,  \uh\,  \uh\,  \uh\, 
\n\ ,  \n\ ,  \n\ ,  \n\ ,  \n\) .  Each  element  in  the  phone  stream  represents  one  image 
frame.  We  define  T  to  be  the  length  of  the  entire  utterance  in  frames. 

Since  the  audio  is  aligned,  it  is  possible  to  examine  all  the  flow  and  texture  pa¬ 
rameters  for  any  particular  phoneme.  Shown  in  Figure  7-1  are  histograms  for  the 
q;i  parameter  for  the  \w\ ,  \m\ ,  \aa\  and  \ow\  phones.  Evaluation  of  the  analyzed 
parameters  from  the  corpus  reveals  that  parameters  representing  the  same  phoneme 
tend  to  cluster  in  MMM  space.  We  represent  each  phoneme  p  mathematically  as 
a  multidimensional  Gaussian  with  mean  /jp  and  diagonal  covariance  Ep.  Separate 
means  and  covariances  are  estimated  for  the  flow  and  texture  parameters  1. 

1  Technically,  since  the  texture  parameters  are  non-negative,  they  are  best  modeled  using  Gamma 
distributions  not  Gaussians.  In  that  case,  Equation  7.1  needs  to  be  re-written  for  Gamma  distribu- 
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Figure  7-1:  Histograms  for  the  a\  parameter  for  the  \w\,  \m\,  \aa\  and  \ow\ 
phones. 


The  trajectory  synthesis  problem  is  framed  mathematically  as  a  regularization 
problem  [21]  [42],  The  goal  is  to  synthesize  a  trajectory  y  which  minimizes  an  objective 
function  E  consisting  of  a  target  term  and  a  smoothness  term: 


E  =  (y  —  /i)TDTE-1D(y  -  p)  +A  yTWTWy .  (7.1) 

' - „ - v - » - ' 

target  term  smoothness 

The  desired  trajectory  y  is  a  vertical  concatenation  of  the  individual  yt  =  at  terms 
at  each  time  step  (or  yt  =  fit,  since  we  treat  flow  and  texture  parameters  separately): 


V  = 


(7.2) 


Vt 

The  target  term  consists  of  the  relevant  means  /i  and  covariances  £  constructed 
from  the  phone  stream: 


tions.  In  practice,  however,  we  have  found  Gaussians  to  work  well  enough  for  texture  parameters. 
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(7.3) 


ript 

,£  = 

£ 

w 

_ i 

The  matrix  D  is  a  duration-weighting  matrix  which  emphasizes  the  shorter  phonemes 
and  de-emphasizes  the  longer  ones,  so  that  the  objective  function  is  not  heavily  skewed 
by  the  phonemes  of  longer  duration: 


D  = 


(7.4) 


sJT 


One  possible  smoothness  term  consists  of  the  first  order  difference  operator: 


W  = 


-I  I 
-I  I 


-I  I 


(7,5) 


Higher  orders  of  smoothness  are  formed  by  repeatedly  multiplying  W  with  itself: 
second  order  WTWTWW ,  third  order  WTWTWTWWW ,  and  so  on. 

Finally,  the  regularizer  A  determines  the  trade-off  between  both  terms. 

Taking  the  derivative  of  Equation  7.1  and  minimizing  yields  the  following  equation 
for  synthesis: 


(Dt'E~1D  +  A  WTW)y  =  DtT,~1D/j,.  (7.6) 

Given  known  means  /j,  covariances  E,  and  regularizer  A,  synthesis  is  simply  a 
matter  of  plugging  them  into  Equation  7.6  and  solving  for  y  using  Gaussian  elim¬ 
ination.  This  is  done  separately  for  the  flow  and  the  texture  parameters.  In  our 
experiments  a  regularizer  of  degree  four  yielding  multivariate  additive  septic  splines 
[42]  gave  satisfactory  results  (see  next  subsection). 
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Coarticulation  effects  in  our  system  are  modeled  via  the  magnitude  of  the  variance 
Tip  for  each  phoneme.  Small  variance  means  the  trajectory  must  pass  through  that 
region  in  phoneme  space,  and  hence  neighboring  phonemes  have  little  coarticulatory 
effect.  On  the  other  hand,  large  variance  means  the  trajectory  has  a  lot  of  flexibility 
in  choosing  a  path  through  a  particular  phonetic  region,  and  hence  it  may  choose  to 
pass  through  regions  which  are  closer  to  a  phoneme’s  neighbors.  The  phoneme  will 
thus  experience  large  coarticulatory  effects. 

There  is  no  explicit  model  of  phonetic  dynamics  in  our  system.  Instead,  phonetic 
dynamics  emerge  implicitly  through  the  interplay  between  the  magnitude  of  the  vari¬ 
ance  TP  for  each  phoneme  (which  determines  the  phoneme’s  “spatial”  extent),  and 
the  input  phone  stream  (which  determines  the  duration  in  time  of  each  phoneme). 
Equation  7.1  then  determines  the  speed  through  a  phonetic  region  in  a  manner  which 
balances  nearness  to  the  phoneme  with  smoothness  of  the  overall  trajectory.  In  gen¬ 
eral,  we  find  the  trajectories  speed  up  in  regions  of  small  duration  and  small  variance 
(ie  plosives),  while  they  slow  down  in  regions  of  large  duration  and  large  variance  (ie 
silences). 

7.2  Training 

The  means  pp  and  covariances  Tp  for  each  phone  p  are  initialized  directly  from  the 
data  using  sample  means  and  covariances.  However,  the  sample  estimates  tend  to 
average  out  the  mouth  movement  so  that  it  looks  under-articulated.  As  a  consequence, 
there  is  a  need  to  adjust  the  means  and  variances  to  better  reflect  the  training  data. 

Gradient  descent  learning  [6]  is  employed  to  adjust  the  mean  and  covariances. 
First,  the  Euclidean  error  metric  is  chosen  to  represent  the  error  between  the  original 
utterance  z  and  the  synthetic  utterance  y. 

E=(z-y)T(z-y).  (7.7) 

The  parameters  {ppi  need  to  be  changed  to  minimize  this  objective  function  E. 
The  chain  rule  may  be  used  to  derive  the  relationship  between  E  and  the  parameters: 
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Figure  7-2:  Top:  The  analyzed  trajectory  for  a12  (in  solid  blue),  compared  with  the 
synthesized  trajectory  for  a12  before  training  (in  green  dots)  and  after  training  (in 
red  crosses).  Bottom:  Same  as  above,  but  the  trajectory  is  for  f32 8-  Both  trajectories 
are  from  the  word  tabloid. 


dE  _  (dE\T  (  dy  \ 
d^i  \dy )  \d/Ji) 

dE  _  (dE\T  f  dy\ 

dai:j  \dy  )  \d<7ij  J 

'jE  may  be  obtained  from  Equation  7.7: 


(7.8) 

(7.9) 


dE 

dy 


-2  (z-y). 


(7.10) 


Since  y  is  defined  according  to  Equation  7.6,  we  can  take  its  derivative  to  compute 
P-  and 

Of-li  OC  ij 


(DtE-xD  +  A  WTW)^~  =  DtT1~1D^~ 

O  Hi  U  Hi 


(7.11) 
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(7.12) 


(DtE~1D  +  A  WTW) 


dy 

do. 


2 DTTT1^-Tr1D{y- y). 


doi 


u 


Finally,  gradient  descent  is  performed  by  changing  the  previous  values  of  the 
parameters  according  to  the  computed  gradient: 


y'^ 

\pnew 


y 


old 


=  T,old-y 


dE 

dE 

W 


(7.13) 

(7.14) 


Cross-validation  sessions  were  performed  to  evaluate  the  appropriate  value  of  A 
and  the  correct  level  of  smoothness  W  to  use.  The  learning  rate  rj  was  set  to  0.00001 
for  all  trials,  and  10  iterations  performed.  Comparison  between  batch  and  online  up¬ 
dates  indicated  that  online  updates  perform  better,  so  this  method  was  used  through¬ 
out  training.  Testing  was  performed  on  a  set  composed  of  1-syllable  words,  2-syllable 
words,  and  sentences  not  contained  in  the  training  set.  The  Euclidean  norm  between 
the  synthesized  trajectories  and  the  original  trajectories  was  used  to  measure  error. 
The  results  showed  that  the  optimal  smoothness  operator  is  fourth  order  and  the  op¬ 
timal  regularizer  is  A  =  1000.  Figure  7-2  depicts  synthesized  trajectories  for  the  a12 
and  fas  parameters  before  training  (in  green  dots)  and  after  training  (in  red  crosses) 
for  these  optimal  values  of  W  and  A. 
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Chapter  8 


Post-Processing 


Due  to  the  head  and  eye  normalization  that  was  performed  during  the  pre-processing 
stage,  the  final  animations  generated  by  our  system  exhibit  movement  only  in  the 
mouth  region.  This  leads  to  an  unnerving  “zombie” -like  quality  to  the  final  ani¬ 
mations.  To  address  this,  we  composite  the  synthesized  mouth  onto  a  background 
sequence  which  contains  natural  head  and  eye  movement. 

8.1  Adding  Noise 

The  first  step  in  the  composition  process  is  to  add  Gaussian  noise  to  the  synthesized 
images  to  regain  the  camera  image  sensing  noise  that  is  lost  as  a  result  of  blending 
multiple  image  prototypes  in  the  MMM.  We  estimate  means  and  variances  for  this 
noise  by  computing  differences  between  original  images  and  images  synthesized  by  our 
system,  and  averaging  over  200  images.  Shown  in  Figure  8-1  at  the  top  are  estimates 
for  this  noise  for  the  R  channel.  Generally  this  process  yields  unacceptably  high  noise 
variances  around  the  mouth  region  due  to  synthesis  mismatch,  so  we  replace  these 
estimates  with  values  from  the  cheek  region. 
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Figure  8-1:  Top:  Estimated  mean  and  standard  deviations  for  the  image  error  be¬ 
tween  original  and  synthetic  images.  The  values  are  unacceptably  high  around  the 
mouth  region,  leading  to  high  flicker  around  the  mouth  region  if  the  noise  is  sampled. 
Bottom:  The  error  values  after  the  area  around  the  mouth  region  is  replaced  with 
more  acceptable  values  from  the  cheek  region. 


8.2  Compositing  onto  a  Background  Sequence 

After  noise  is  added,  the  synthesized  sequences  are  composited  onto  the  chosen  back¬ 
ground  sequence  with  the  help  of  the  masks  shown  in  Figure  5-1.  The  head  mask 
is  first  forward  warped  using  optical  flow  to  fit  across  the  head  of  each  image  of  the 
background  sequence.  Next,  optical  flow  is  computed  between  each  background  im¬ 
age  and  its  corresponding  synthetic  image.  The  synthetic  image  and  the  mouth  mask 
from  Figure  5-1  are  then  perspective-warped  back  onto  the  background  image.  The 
perspective  warp  is  estimated  using  only  the  flow  vectors  lying  within  the  background 
head  mask.  The  final  composite  is  made  by  pasting  the  warped  mouth  onto  the  back¬ 
ground  image  using  the  warped  mouth  mask.  The  mouth  mask  is  smoothed  at  the 
edges  to  perform  a  seamless  blend  between  the  background  image  and  the  synthesized 
mouth.  The  compositing  process  is  depicted  in  Figure  8-2. 
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Figure  8-2:  The  background  compositing  process:  Top:  A  background  sequence  with 
natural  head  and  eye  movement.  Middle:  A  sequence  generated  from  our  system, 
with  the  desired  mouth  movement  and  appropriate  masking.  Bottom:  The  final 
composited  sequence  with  the  desired  mouth  movement,  but  with  the  natural  head 
and  eye  movements  of  the  background  sequence.  The  masks  from  Figure  5-1  are  used 
to  guide  the  compositing  process. 
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Chapter  9 


Computational  Issues 


To  use  our  system,  an  animator  first  provides  phonetically  annotated  audio.  The 
annotation  may  be  done  automatically  [24],  semi-automatically  using  a  text  transcript 
[24],  or  manually  [39]. 

Trajectory  synthesis  is  performed  by  Equation  7.6  using  the  trained  phonetic 
models.  This  is  done  separately  for  the  flow  and  the  texture  parameters.  After  the 
parameters  are  synthesized,  Equation  6.7  from  Section  6.3  is  used  to  create  the  visual 
stream  with  the  desired  mouth  movement.  Typically  only  the  image  prototypes  /j 
which  are  associated  with  top  10  values  of  /%  are  warped,  which  yields  a  considerable 
savings  in  computation  time.  MMM  synthesis  takes  on  the  order  of  about  7  seconds 
per  frame  for  an  image  resolution  of  624x472.  The  background  compositing  process 
adds  on  a  few  extra  seconds  of  processing  time.  All  times  are  computed  on  a  450 
MHz  Pentium  II. 
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Chapter  10 


Evaluation 


We  have  synthesized  numerous  examples  using  our  system,  spanning  the  entire  range 
of  1-syllable  words,  2-syllable  words,  short  sentences,  and  long  sentences.  In  addition, 
we  have  synthesized  songs  and  foreign  speech  examples. 

Experimentally  we  have  found  that  reducing  the  number  of  prototypes  below  30 
degrades  the  quality  of  the  final  animations.  An  open  question  is  whether  increasing 
the  number  of  prototypes  significantly  beyond  46  will  lead  to  even  higher  levels  of 
videorealism. 

In  terms  of  corpus  size,  it  is  possible  to  optimize  the  spoken  corpus  so  that  several 
words  alone  elicit  the  46  prototypes.  This  would  reduce  the  duration  of  the  corpus 
from  15  minutes  to  a  few  seconds.  However,  this  would  degrade  the  quality  of  the  cor¬ 
respondences  computed  by  the  Dijkstra  algorithm.  In  addition,  the  phonetic  training 
performed  by  our  trajectory  synthesis  module  would  degrade  as  well.  Further  system¬ 
atic  experiments  need  to  be  made  in  order  to  evaluate  how  final  performance  changes 
with  the  size  of  the  corpus. 

We  evaluated  our  results  by  performing  three  different  visual  “Turing  tests”  to 
see  whether  human  subjects  can  distinguish  between  real  sequences  and  synthetic 
ones.  In  the  first  experiment  (“single  presentation”),  subjects  were  asked  to  view  one 
visual  sequence  at  a  time,  and  identify  whether  it  is  real  or  synthetic.  In  a  similar 
second  experiment  (“fast  single  presentation”),  the  subjects  were  asked  to  make  the 
judgments  in  a  fast  manner  while  the  utterances  were  being  presented  without  pauses 
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Experiment 

#  subjects 

%  correct 

t 

p< 

Single  pres. 

22 

54.3% 

1.243 

0.3 

Fast  single  pres. 

21 

52.1% 

0.619 

0.5 

Double  pres. 

22 

46.6% 

-0.75 

0.5 

Table  10.1:  Levels  of  correct  identification  of  real  and  synthetic  sequences,  “t”  rep¬ 
resents  the  value  from  a  standard  t-test  with  significance  level  indicated  in  the  “p<” 
column. 

in  between.  In  a  third  experiment  (“double  presentation”),  the  subjects  were  asked 
to  view  pairs  of  the  same  utterance,  where  one  item  in  the  pair  is  real  and  the  other 
is  synthetic  (but  randomly  ordered).  The  subjects  in  this  experiment  were  asked  to 
identify  which  utterance  in  the  pair  is  real,  and  which  is  synthetic.  16  or  18  utterances 
were  presented  to  each  subject,  with  half  being  real  and  half  being  synthetic.  As  seen 
from  Table  10.1,  performance  in  all  three  experiments  was  close  to  chance  level  (50%) 
and  not  significantly  different  from  it. 

Finally,  we  also  evaluated  our  system  by  performing  intelligibility  tests  in  which 
subjects  were  asked  to  lip  read  a  set  of  natural  and  synthetic  utterances. 

Details  on  all  experiments  are  forthcoming  in  a  separate  article. 
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Chapter  11 


Further  Work 


The  main  limitation  of  our  technique  is  the  difficulty  of  re-compositing  synthesized 
mouth  sequences  into  background  sequences  which  involve  1)  large  changes  in  head 
pose,  2)  changes  in  lighting  conditions,  and  3)  changes  in  viewpoint.  All  these  lim¬ 
itations  can  be  alleviated  by  extending  our  approach  from  2D  to  3D.  It  is  possible 
to  envision  a  real-time  3D  scanner  that  is  capable  of  recording  a  3D  video  corpus  of 
speech.  Alternatively,  techniques  such  as  those  presented  in  [22]  [35]  [9]  can  be  used 
to  map  a  2D  video  corpus  into  3D. 

The  geodesic  trajectory  synthesis  equations  described  by  Brand  et  al.  [10]  [11] 
are  analogous  (and  more  sophisticated)  than  the  trajectory  synthesis  techniques  we 
use  (Equations  7.1  and  7.6).  Although  those  equations  require  considerably  more 
training  data,  it  is  possible  they  could  lead  to  higher  levels  of  videorealism. 

Clearly  the  face  is  used  as  a  conduit  to  transmit  emotion,  so  one  possible  avenue 
to  explore  is  the  synthesis  of  speech  under  various  emotional  states.  It  is  possible  to 
record  various  corpora  under  different  emotional  states  and  create  MMMs  for  each 
state.  During  synthesis,  the  appropriate  MMM  is  selected.  An  open  question  to 
explore  is  emotional  dynamics :  how  does  one  transition  from  a  happy  MMM  to  a 
sad  MMM?  Additionally,  there  is  also  a  need  to  learn  generative  models  of  head 
movement  and  eye  movement  tailored  for  the  type  of  speech  being  synthesized. 
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Appendix  A 


Appendix:  Flow  Concatenation 


Given  a  series  of  consecutive  images  I0,Ii, . . .  In,  we  would  like  to  construct  the  cor¬ 
respondence  map  C'o(.„)  relating  I0  to  In.  We  focus  on  the  case  of  the  3  images 
since  the  concatenation  algorithm  is  simply  an  iterative  application  of 
this  3-frame  base  case.  Optical  flow  is  first  computed  between  the  consecutive  frames 
to  yield  Cp-pj,  Cp+i).  Note  that  it  is  not  correct  to  construct  C'(j_i)(j+ 1)  as  the  sim¬ 
ple  addition  of  Cp- ip  +  (?*(*+ 1)  because  the  two  flow  helds  are  with  respect  to  two 
different  reference  images.  Vector  addition  needs  to  be  performed  with  respect  to  a 
common  origin. 

Our  concatenation  thus  proceeds  in  two  steps:  to  place  all  vector  fields  in  the 
same  reference  frame,  the  correspondence  map  C^i+ p  itself  is  warped  backwards  [45] 
along  C'p_1)j  to  create  C^^d.  Now  and  are  both  added  to  produce 

an  approximation  to  the  desired  concatenated  correspondence: 

r<  _  / — »  _i_  sywarped  /a  a  \ 

G(i-I)(j+1)  -  .  (A.ij 

A  procedural  version  of  our  backwarp  warp  is  shown  in  figure  A-l.  BILINEAR  refers 
to  bilinear  interpolation  of  the  4  pixel  values  closest  to  the  point  (x ,  y) . 


48 


for  j  =  0. .  .height , 
for  i  =  0.  ..width, 
x  =  i  +  dx(i ,  j) ; 
y  =  j  +  dy(i ,  j) ; 

jwarped  Qj)  =  BILIMEAR  (I;  x>  y)  • 


Figure  A-l:  BACKWARD  WARP  algorithm 
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Appendix  B 


Appendix:  Forward  Warping 

Forward  warping  may  be  viewed  as  “pushing”  the  pixels  of  an  image  /  along  the 
computed  flow  vectors  C.  We  denote  the  forward  warping  operation  as  an  operator 
W (I,C)  that  operates  on  an  image  I  and  a  correspondence  map  C,  producing  a 
warped  image  IwarPed  as  final  output.  A  procedural  version  of  our  forward  warp  is 
shown  in  Figure  B-l. 

It  is  also  possible  to  forward  warp  a  correspondence  map  C'  along  another  corre¬ 
spondence  C,  which  we  denote  as  W(C",  C).  In  this  scenario,  the  x  and  y  components 
of  C"(p)  =  {d'x(p),  d'y(p)}  are  treated  as  separate  images,  and  warped  individually 
along  C:  W (dx',C)  and  W (dy',C). 

for  j  =  0. .  .height , 
for  i  =  0.  ..width, 

x  =  ROUND  (i  +  crdx(i,j)  ); 
y  =  ROUND  (j  +  ady(i,j)  ); 
if  (x,y)  are  within  the  image 

jwarped (x>y)  =  IQJ); 

Figure  B-l:  FORWARD  WARP  algorithm 
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Appendix  C 


Appendix:  Hole-Filling 


Forward  warping  produces  black  holes  which  occur  in  cases  where  a  destination  pixel 
was  not  filled  in  with  any  source  pixel  value.  This  occurs  due  to  inherent  nonzero 
divergence  in  the  optical  flow,  particularly  around  the  region  where  the  mouth  is 
expanding.  To  remedy  this,  a  hole-filling  algorithm  [15]  was  adopted  which  pre¬ 
fills  a  destination  image  with  a  special  reserved  background  color.  After  warping, 
the  destination  image  is  traversed  in  rasterized  order  and  the  holes  are  filled  in  by 
interpolating  linearly  between  their  non-hole  endpoints. 

In  the  context  of  our  synthesis  algorithm  in  Section  6.3,  hole-filling  can  be  per¬ 
formed  before  blending ,  or  after  blending.  Throughout  this  paper,  we  assume  hole¬ 
filling  is  performed  before  blending,  which  allows  us  to  subsume  the  hole-filling  pro¬ 
cedure  into  our  forward  warp  operator  W  and  simplify  our  notation.  Consequently 
(as  in  Equation  6.6),  the  blending  operation  becomes  a  simple  linear  combination  of 
the  hole-filled  warped  intermediates  jf;arped_ 

In  practice,  however,  we  perform  hole-filling  after  blending ,  which  reduces  the  size 
of  the  holes  that  need  to  be  filled,  and  leads  to  a  considerable  reduction  in  synthesis 
blur.  Post-blending  hole-filling  requires  a  more  complex  blending  algorithm  than  as 
noted  in  Equation  6.6  because  the  blending  algorithm  now  needs  to  keep  track  of 
holes  and  non-holes  in  the  warped  intermediate  images  /™arj,ed; 
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(c.i) 

^I^arped(x,y)^hole  & 

Typically  an  accumulator  array  is  used  to  keep  track  of  the  denominator  term  in  Equa¬ 
tion  C.I  above.  The  synthesized  mouth  images  shown  in  Figure  6-3  were  generated 
using  post-blending  hole-filling. 
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